Thinking Hallucination for Video Captioning

نویسندگان

چکیده

AbstractWith the advent of rich visual representations and pre-trained language models, video captioning has seen continuous improvement over time. Despite performance improvement, models are prone to hallucination. Hallucination refers generation highly pathological descriptions that detached from source material. In captioning, there two kinds hallucination: object action Instead endeavoring learn better a video, in this work, we investigate fundamental sources hallucination problem. We identify three main factors: (i) inadequate features extracted (ii) improper influences target contexts during multi-modal fusion, (iii) exposure bias training strategy. To alleviate these problems, propose robust solutions: (a) introduction auxiliary heads trained multi-label settings on top (b) addition context gates, which dynamically select fusion. The standard evaluation metrics for measures similarity with ground truth captions do not adequately capture relevance. end, new metric, COAHA (caption assessment), assesses degree Our method achieves state-of-the-art MSR-Video Text (MSR-VTT) Microsoft Research Video Description Corpus (MSVD) datasets, especially by massive margin CIDEr score.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reconstruction Network for Video Captioning

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (...

متن کامل

Consensus-based Sequence Training for Video Captioning

Captioning models are typically trained using the crossentropy loss. However, their performance is evaluated on other metrics designed to better correlate with human assessments. Recently, it has been shown that reinforcement learning (RL) can directly optimize these metrics in tasks such as captioning. However, this is computationally costly and requires specifying a baseline reward at each st...

متن کامل

Deep Learning for Video Classification and Captioning

Accelerated by the tremendous increase in Internet bandwidth and storage space, video data has been generated, published and spread explosively, becoming an indispensable part of today's big data. In this paper, we focus on reviewing two lines of research aiming to stimulate the comprehension of videos with deep learning: video classification and video captioning. While video classification con...

متن کامل

Multimodal Memory Modelling for Video Captioning

Video captioning which automatically translates video clips into natural language sentences is a very important task in computer vision. By virtue of recent deep learning technologies, e.g., convolutional neural networks (CNNs) and recurrent neural networks (RNNs), video captioning has made great progress. However, learning an effective mapping from visual sequence space to language space is st...

متن کامل

Grounded Objects and Interactions for Video Captioning

We address the problem of video captioning by grounding language generation on object interactions in the video. Existing work mostly focuses on overall scene understanding with often limited or no emphasis on object interactions to address the problem of video understanding. In this paper, we propose SINet-Caption that learns to generate captions grounded over higher-order interactions between...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2023

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-031-26316-3_37